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Abstract 

The fc-means method is an iterative clustering algorithm which associates each observation with one of fc clus¬ 
ters. It traditionally employs cluster centers in the same space as the observed data. By relaxing this requirement, it 
is possible to apply the /.-means method to infinite dimensional problems, for example multiple target tracking and 
smoothing problems in the presence of unknown data association. Via a T-convergence argument, the associated 
optimization problem is shown to converge in the sense that both the fc-means minimum and minimizers converge 
in the large data limit to quantities which depend upon the observed data only through its distribution. The theory 
is supplemented with two examples to demonstrate the range of problems now accessible by the /.'-means method. 
The first example combines a non-parametric smoothing problem with unknown data association. The second 
addresses tracking using sparse data from a network of passive sensors. 

1 Introduction 

The /. -means algorithm l23l is a technique for assigning each of a collection of observed data to exactly one of k 
clusters, each of which has a unique center, in such a way that each observation is assigned to the cluster whose 
center is closest to that observation in an appropriate sense. 

The /.--means method has traditionally been used with limited scope. Its usual application has been in Euclidean 
spaces which restricts its application to finite dimensional problems. There are relatively few theoretical results 
using the /. -means methodology in infinite dimensions of which El 8 l~i l 191421 1 1281 are the only papers known 
to the authors. In the right framework, post-hoc track estimation in multiple target scenarios with unknown data 
association can be viewed as a clustering problem and therefore accessible to the fc-means method. In such problems 
one typically has finite-dimensional data, but would wish to estimate infinite dimensional tracks with the added 
complication of unresolved data association. It is our aim to propose and characterize a framework for the fc-means 
method which can deal with this problem. 

A natural question to ask of any clustering technique is whether the estimated clustering stabilizes as more data 
becomes available. More precisely, we ask whether certain estimates converge, in an appropriate sense, in the large 
data limit. In order to answer this question in our particular context we first establish a related optimization problem 
and make precise the notion of convergence. 

Consistency of estimators for ill-posed inverse problems has been well studied, for example mm. but with¬ 
out the data association problem. In contrast to standard statistical consistency results, we do not assume that there 
exists a structural relationship between the optimization problem and the data-generating process in order to estab¬ 
lish convergence to true parameter values in the large data limit; rather, we demonstrate convergence to the solution 
of a related limiting problem. 

This paper shows the convergence of the minimization problem associated with the fc-means method in a frame¬ 
work that is general enough to include examples where the cluster centers are not necessarily in the same space 
as the data points. In particular we are motivated by the application to infinite dimensional problems, e.g. the 
smoothing-data association problem. The smoothing-data association problem is the problem of associating data 
points {(ti, Zj)}™-! C [0,1] x K. K to unknown trajectories /ij : [0,1] — > K. K for j = 1, 2,..., fc. By treating the tra¬ 
jectories fj,j as the cluster centers one may approach this problem using the fc-means methodology. The comparison 
of data points to cluster centers is a pointwise distance: d((ti, Zi), Hj) = \ fi 3 (t t ) — Zi | 2 (where | • | is the Euclidean 
norm on R K ). To ensure the problem is well-posed some regularization is also necessary. For fc = 1 the problem 
reduces to smoothing and coincides with the limiting problem studied in lfl7l . We will discuss the smoothing-data 
association problem more in Section [431 
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Let us now introduce the notation for our variational approach. The /, -means method is a strategy for partition¬ 
ing a data set \I/ n = C X into k clusters where each cluster has center /i ? for j = 1,2,..., k. First let us 

consider the special case when yj G X. The data partition is defined by associating each data point with the cluster 
center closest to it which is measured by a cost function d : X x X —> [0, oo). Traditionally the fc-means method 
considers Euclidean spaces X = R K , where typically we choose d(x, y) = \x — y | 2 = ~ Vi) 2 - We define 

the energy for a choice of cluster centers given data by 


.j n k 
n i=1 j =1 

where for any k variables, ai, a.2, ..., a/c, Aj=i a j := niinjai,..., a^}. The optimal choice of y is that which 
minimizes /„(-|'T„). We define 

9 n = min f„(y\^ n ) G R. 

An associated “limiting problem” can be defined 


9 = min (y) 

Aiex fc 

where we assume, in a sense which will be made precise later, that £* ~ P for some suitable probability distribution, 
P, and define 

/ k 

/\ d{x,yj)P( dx). 
i =i 

In Section|3]we validate the formulation by first showing that, under regularity conditions and with probability one, 
the minimum energy converges: 9 n —> 9. And secondly by showing that (up to a subsequence) the minimizers 
converge: y n —> y°° where y n minimizes /„ and y°° minimizes /oo (again with probability one). 

In a more sophisticated version of the fc-means method the requirement that y ;i G X can be relaxed. We 
instead allow y = (y \. y %,..., yk) G Y k for some other Banach space, Y, and define d appropriately. This leads 
to interesting statistical questions. When Y is infinite dimensional even establishing whether or not a minimizer 
exists is non-trivial. 

When the cluster center is in a different space to the data, bounding the set of minimizers becomes less natural. 
For example, consider the smoothing problem in which one wishes to fit a continuous function to a set of data 
points. The natural choice of cost function is a pointwise distance of the data to the curve. The optimal solution is 
for the cluster center to interpolate the data points: in the limit the cluster center may no longer be well defined. In 
particular we cannot hope to have converging sequences of minimizers. 

In the smoothing literature this problem is prevented by using a regularization term r : Y k —> K. For a cost 
function d : X x Y —> [0, oo) the energies / ra (-| v I'n)j foo{') '■ Y k —> R are redefined 

^ n k 

fn{y\'^n) = — ^ ' A d(£i, yj) + X n r(y) 

n i \ 

*=i j=i 

r. k 

foo(y) = I A d(x,yj)P(dx) + A r(y). 

3 = 1 

Adding regularization changes the nature of the problem so we commit time in Section|4]to justifying our approach. 
Particularly we motivate treating X n = X as a constant independent of n. We are able to repeat the analysis from 
Section 4; that is to establish that the minimum and a subsequence of minimizers still converge. 

Early results assumed Y = X were Euclidean spaces and showed the convergence of minimizers to the appro¬ 
priate limit 11811251 . The motivation for the early work in this area was to show consistency of the methodology. In 
particular this requires there to be an underlying ‘truth’. This requires the assumption that there exists a unique min¬ 
imizer to the limiting energy. These results do not hold when the limiting energy has more than one minimizer (4|. 
In this paper we discuss only the convergence of the method and as such require no assumption as to the existence 
or uniqueness of a minimizer to the limiting problem. Consistency has been strengthened to a central limit theorem 
in ll26ll also assuming a unique minimizer to the limiting energy. Other rates of convergence have been shown 
in I21[3ll9l l22l . In Hilbert spaces there exist convergence results and rates of convergence for the minimum. In O 
the authors show that \f n {y n ) — foo(y°°)\ is of order -4=, however, there are no results for the convergence of 
minimizers. Results exist for k —» oo, see for example {§] (which are also valid for Y A X). 

Assuming that Y - X, the convergence of the minimization problem in a reflexive and separable Banach space 
has been proved in OTl and a similar result in metric spaces in ll20l . In lfl9l . the existence of a weakly converging 
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subsequence was inferred using the results of Ell¬ 
in the following section we introduce the notation and preliminary material used in this paper. 

We then, in Section QJ consider convergence in the special case when the cluster centers are in the same space 
as the data points, i.e. Y = X. In this case we don’t have an issue with well-posedness as the data has the same 
dimension as the cluster centers. For this reason we use energies defined without regularization. Theorem l3.5l shows 
that the minimum converges, i.e. 9 n —> 6 as n —> oo, for almost every sequence of observations and furthermore 
we have a subsequence pA™ of minimizers of f„ m which weakly converge to some p°° which minimizes /oo. 

This result is generalized in SectionQ]to an arbitrary X and Y. The analogous result to Theorem 13. 5 1 is Theo¬ 
rem [43] We first motivate the problem and in particular our choice of scaling in the regularization in Section |4~T1 
before proceeding to the results in Section l4~2l Verifying the conditions on the cost function d and regularization 
term r is non-trivial and so we show an application to the smoothing-data association problem in Section l43l 
To demonstrate the generality of the results in this paper, two applications are considered in Section [3] The first 
is the data association and smoothing problem. We show the minimum converging as the data size increases. We 
also numerically investigate the use of the fc-means energy to determine whether two targets have crossed tracks. 
The second example uses measured times of arrival and amplitudes of signals from moving sources that are received 
across a network of three sensors. The cluster centers are the source trajectories in R 2 . 

2 Preliminaries 

In this section we introduce some notation and background theory which will be used in Sections[3]and[4]to establish 
our convergence results. In these sections we show the existence of optimal cluster centers using the direct method. 
By imposing conditions, such that our energies are weakly lower semi-continuous, we can deduce the existence 
of minimizers. Further conditions ensure the minimizers are uniformly bounded. The [’-convergence framework 
(e.g. 033) allows us to establish the convergence of the minimum and also the convergence of minimizers. 

We have the following definition of V -convergence with respect to weak convergence. 

Definition 2.1 (T-convergence). A sequence /„ : A — > R U {±oo} on a Banach space (A, || • m) is said to 
T-converge on the domain A to /oo : A —> R U {±oo} with respect to weak convergence on A, and we write 
/oo = F- lim„ f n , if for all x & Awe have 

(i) (liminf inequality) for every sequence (x n ) weakly converging to x 

foo(x) < liminf f n (x„)\ 

n 

(ii) (recovery sequence) there exists a sequence ( x n ) weakly converging to x such that 

foo(x) > limsup f n (x n ). 
n 

When it exists the T-limit is always weakly lower semi-continuous, and thus admits minimizers. An important 
property of T-convergence is that it implies the convergence of minimizers. In particular, we will make extensive 
use of the following well-known result. 

Theorem 2.1 (Convergence of Minimizers). Let f n : A —> R be a sequence of functionals on a Banach space 
(A, || • m) and assume that there exists N > 0 and a weakly compact subset K C A with 

inf f n = inf f n Vn > N. 

A K 

If /oo = F- lim„ f n and /oo is not identically ±oo then 

min/oo = liminf /„. 

A n A 

Furthermore if each f n is weakly lower semi-continuous then for each f n there exists a minimizer x n £ I\ and any 
weak limit point of x n minimizes /oo- Since K is weakly compact there exists at least one weak limit point. 

A proof of the theorem can be found in J6] Theorem 1.21], 

The problems which we address involve random observations. We assume throughout the existence of a prob¬ 
ability space (Cl, IF, P), rich enough to support a countably infinite sequence of such observations, .... All 
random elements are defined upon this common probability space and all stochastic quantifiers are to be under¬ 
stood as acting with respect to P unless otherwise stated. Where appropriate, to emphasize the randomness of the 
functionals /„, we will write f,f ' to indicate the functional associated with the particular observation sequence 
..., and we allow P„ u ' 1 to denote the associated empirical measure. 
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We define the support of a (probability) measure to be the smallest closed set such that the complement is null. 
For clarity we often write integrals using operator notation. I.e. for a measure P, which is usually a probability 
distribution, we write 

Ph = fhWP(te). 

For a sequence of probability distributions, P n , we say that P n converges weakly to P if 

P n h —> Ph for all bounded and continuous h 

and we write P n => P. With a slight abuse of notation we will sometimes write P(U) := Pljj for a measurable 
set U. 

For a Banach space A one can define the dual space A* to be the space of all bounded and linear maps over 
A into R equipped with the norm ||P||, 4 * = sup xgj4 |P(at)|. Similarly one can define the second dual A** as the 
space of all bounded and linear maps over A* into R. Reflexive spaces are defined to be spaces A such that A is 
isometrically isomorphic to A**. These have the useful property that closed and bounded sets are weakly compact. 
For example any L p space (with 1 < p < oo) is reflexive, as is any Hilbert space (by the Riesz Representation 
Theorem: if A is a Hilbert space then A* is isometrically isomorphic to A). 

A sequence x n £ A is said to weakly convergence to x £ A if F(x n ) —> F(x) for all F £ A*. We write 
x n —-■ x. We say a functional G : A —> R is weakly continuous if G(x n ) —> G(x) whenever x n —*■ x and strongly 
continuous if G(x n ) —> G(x) whenever \\x n — x ||a —> 0. Note that weak continuity implies strong continuity. 
Similarly a functional G is weakly lower semi-continuous if lim inf^^oo G(x n ) > G(x) whenever x n x. 

We define the Sobolev spaces W S,P (I) on / C R by 

W s ’ p = W S ' P (I) = {/:/—► R s.t. <97 e L P (I) for i = 0,..., s) 

where we use <9 for the weak derivative, i.e. g = <9/ if for all 4> €E C%° (/) (the space of smooth functions with 
compact support) 

//7)^7) da’ = - J g(x)<f>(x) da-. 

In particular, we will use the special case when p = 2 and we write 11 s = W s ’ 2 . This is a Hilbert space with norm: 

S 

ll/ll^ = Ell 5i /Hi- 

i=0 

For two real-valued and positive sequences a n and b n we write a n < b n if 'jf- is bounded. For a space A and a 
set I\ C A we write K c for the complement of K in A, i.e. K c = A\K. 

3 Convergence when Y = X 

We assume we are given data points G £ X for i = 1,2,... where X is a reflexive and separable Banach space 
with norm || • ||x and Borel cr-algebra X. These data points realize a sequence of T'-measurable random elements 
on (fl, F, P) which will also be denoted, with a slight abuse of notation, 

We define 


fk u) ■ * k R. 

f x : ->• R, 


I n k 

fYX») = pYX, = - E A d ^\N) 

II i =1 3 = 1 

P k 

foo(n) = Pg M = J A d{x,Hj)P{dx) 


(1) 

( 2 ) 


where 


k 

9u( x ) = A d( FFj), 

3=1 


P is a probability measure on (X, X), and empirical measure pY associated with 


£ A is defined by 


p,Yh 


1 n 

-E 
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for any A-measurable function h : X —> IR. We assume tfi are iid according to P with P 
We wish to show 


6^ —> 6 for almost every to as n —> oo 


’°C 


-l 


where 


( 3 ) 


9^= inf f^(p) 

U&X k 

0= inf /oo(ft)- 

//.(= X k 


We define || • ||fe : —» [0, oo) by 


IHIfc := max\\nj\\ x for p = ..., Pk) e X fe . (4) 

j 

The reflexivity of (X, || • ||x) carries through to ( X k , || • ||fc). 

Our strategy is similar to that of |25| but we embed the methodology into the F-convergence framework. We 
show that Q is the T-limit in Theorem l3.2l and that minimizers are bounded in Pronosition l3.3l We may then apply 
Theorem l2.1l to infer ([3} and the existence of a weakly converging subsequence of minimizers. 

The key assumptions on d and P are given in Assumptions Q] The first assumption can be understood as a 
‘closeness’ condition for the space X with respect to d. If we let d(x , y) = 1 for x ^ y and d(x, x) = 0 then our 
cost function d does not carry any information on how far apart two points are. Assume there exists a probability 
density for P which has unbounded support. Then fn(p) > (for almost every u>), with equality when 
we choose pj £ Fe. any set of k unique data points will minimize Since our data points are 

unbounded we may find a sequence ||£^||x —>• oo. Now we choose p™ = and clearly our cluster center is 
unbounded. We see that this choice of d violates the first assumption. We also add a moment condition to the upper 
bound to ensure integrability. Note that this also implies that Pd(-, 0) < f x M(||a;||) P(dcc) < oo so /oo( 0 ) < oo 
and, in particular, that f^ is not identically infinity. 

The second assumption is slightly stronger condition on d than a weak lower semi-continuity condition in the 
first variable and strong continuity in the second variable. The condition allows the application of Fatou’s lemma 
for weakly converging probabilities, see El. 

The third assumption allows us to view d(£i,y) as a collection of random variables. The fourth implies that we 
have at least k open balls with positive probability and therefore we are not overfitting clusters to data. 

Assumptions 1. We have the following assumptions on d : X x X —> [0, oo) and P. 

1.1. There exist continuous, strictly increasing functions m , M : [0, oo) —> [0, oo) such that 

m(\\x - y||x) < d{x,y) < M(\\x - y\\x) for all x,y £ X 

with lim r _* 00 Tn{r ) = oo, M( 0) = 0, there exists 7 < 00 such that M(\\x + y\\x) < 7 M(||x||x) + 
7 -M(|| 2 /||x) and finally f x M(\\x\\x) P{dx) < 00 (and M is measurable). 

1.2. For each x,y £ X we have that if x m —> x and y n —*■ y as n,m —> 00 then 

lint inf d(x m , y n ) > d(x, y) and lim d(x m ,y) = d(x,y). 

n,m—f 00 m—f oo 


1.3. For each y £ X we have that d(-, y) is X-measurable. 

1.4. There exist k different centers p,j £ X, j = 1,2,..., k such that for all 6 > 0 

P(B(p],S))> 0 Vj = 1,2,..., k 
where B(p, S ) := {x £ X : \\p — x||.y < 5}. 

We now show that for a particular common choice of cost function, d. Assumptions Q] 1 to|T|3 hold. 

Remark 3.1. For any p > 0 let d(x,y) = \\x — y\\ p x then d satisfies Assumptions\T}l to\T\3. 

Proof. Taking m(r) = M(r) = r p we can bound rn(||x — z/|| jv ) < d(x,y) < M(\\x - y||x) and to, M clearly 
satisfy m(r) 00 , M{ 0) = 0, are strictly increasing and continuous. One can also show that 

M(\\x + y\\ x )<2t , - 1 (\\xr x + \\yr x ) 

hence AssumptionQ] 1 is satisfied. 
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Let x m —> x and y n —*• y. Then 

liminf d{x m ,y n )p = liminf ||a; m - y m ||x 

n,m—f oo n,m— >-oo 

> liminf (||y„ - x||x - ||a: m - x||x) 

n,m—> oo 

= liminf ||y n — x||x since —)• a: 

n—>oo 

> \\y-x\\x 

where the last inequality follows as a consequence of the Hahn-Banach Theorem and the fact that y n - x —*■ y — x 
which implies liminf^^oo \\y n — x\\x > || y — x\\x ■ Clearly d(x m ,y) —► d(x, y) and so Assumption[T|2 holds. 
The third assumption holds by the Borel measurability of metrics on complete separable metric spaces. □ 

We now state the first result of the paper which formalizes the understanding that /oo is the limit of fjf' 1 . 

Theorem 3.2. Let (X , || • ||x) be a reflexive and separable Banach space with Borel a-algebra, X; let {£i}ieN be 
a sequence of independent X-valued random elements with common law P. Assume d : X x X —> [0, oo) and that 
P satisfies the conditions in Assumptions [7] Define fjf^ : X k —> R and /oo : X k 1 fcv ® and (0 respectively. 
Then 

/oo=T-lim/M 

n 

for P -almost every u>. 

Proof Define O' as the intersection of three events: 
fl' = {wen:^Up}n{w6H: P^ (B{ 0, q ) c ) -> P(B{ 0, q) c ) fiq G n} 

n |w G : J I Bl0 , q) c(x)M(\\x\\ x ) P<? ] ( dx) ->• ^I B(0 ,,).(a:)M(||a:||x) P{ da) Vq e N 

By the almost sure weak convergence of the empirical measure the first of these events has probability one, the 
second and third are characterized by the convergence of a countable collection of empirical averages to their 
population average and, by the strong law of large numbers, each has probability one. Hence P(fT) = 1. 

Fix w £ ff: we will show that the lim inf inequality holds and a recovery sequence exists for this u> and hence 
for every ui G Q'. We start by showing the lim inf inequality, allowing G X k to denote any sequence 

which converges weakly to p, G X k . We are required to show: 

liminf /^“V") > /oo(f0- 


By Theorem 1.1 in lfl5l we have 

[ liminf g^{x') P(dx) < liminf [ (x) P^ (da;) = liminf P^g^ 

Jx n—foo,x'—fx n—foo J ^ n—foo 

For each x G X, we have by Assumption[T|2 that 

liminf d(x', pfi) > d(x, p,j). 

x'—>x,n—f oo J 


By taking the minimum over j we have 

k k 

liminf g li n(x') = A liminf d(x' , pfi) > A d(x, p.j) = g^x). 

x'—fx.n—f oo ' x'—fx,n—f oo J ' 

3 =1 3 =1 


Hence 


liminf fi u) (p n ) 


liminf P^g^ 

n—foo 


> f 9n( x ) P(dx) = foo{p) 

J x 


as required. 

We now establish the existence of a recovery sequence for every ui G Cl' and every // G X k . Let /x n = // G X k . 
Let be a C°°(X) sequence of functions such that 0 < ( q {x) < 1 for all x G X, C, q {x) = 1 for x G P(0, q — 1) 
and C q {x) = 0 for x B( 0, q). Then the function ^ q (x)g /J ,(x) is continuous in x (and with respect to convergence 
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in || • |k) for all q. We also have 


Cq(x)g^{x) < ( q (x)d(x,m) 

< Cq{x)M{\\x - Milk) 

< ( q (x)M(\\x\\x + IMk) 

< M(q + Ikilk) 

so QjfJi, is a continuous and bounded function, hence by the weak convergence of to P we have 

Pn \ q 9u f PCq9ft 


as n —> oo for all q £ N. For all q £ N we have 

limsup | 9» ~ PgA < limsup | P^g^ - P^Qg^ + limsup \P^\ q g^ - P( q 9u I + limsup \P( q g M - Pg^\ 

n—too n—too n—too n—> oo 

= limsup | P^ ] g^ - P^CqgA + \PCq9fi -Pg^Y 

n—> oo 

Therefore, 

limsup \P^ ] g^ - Pg^l < lim inf limsup | P^g^ - 

n—too q—>oo n—>-oo 

by the dominated convergence theorem. We now show that the right hand side of the above expression is equal to 
zero. We have 



Pn^CqgA < Pn\B( 0,g-l)) = 5/x 

< P^\B(o, q -i)yd{-, Mi) 

< P n Ul)l (B(0,q-l))cM{\\ • -Milk) 

< 7 (^1(5(0,i))cM(|| • Ik) + MdlMilk)^!^^,,-i))c) 

7 (P\b(o, q -i)) c M (|| • |k) + M(||mi|Ly)PI(b ( o, 9 -i)) 0 as n —> oo 
—>■0 as q — > oo 


where the last limit follows by the monotone convergence theorem. We have shown 

lim I p^ ] g^ -Pgy\ = 0. 

n—>• oo 

Hence 

f^\g) /oo(m) 

as required. □ 

Now we have established almost sure [’-convergence we establish the boundedness condition in Pronosition l3.3l 
so we can apply Theorem l2.ll 

Proposition 3.3. Assuming the conditions of Theorem \3 ,2\ and define || • ||fc by ©, there exists R > 0 such that 

inf /^(m) = „ inf Vn sufficiently large 

M6X fe ||HU<-R 

for P -almost every ui. In particular R is independent of n. 

Proof. The structure of the proof is similar to ll20l Lemma 2.1]. We argue by contradiction. In particular we 
argue that if a cluster center is unbounded then in the limit the minimum is achieved over the remaining k — 1 
cluster centers. We then use Assumption [Q4 to imply that adding an extra cluster center will strictly decrease the 
minimum, and hence we have a contradiction. 

We define f l" to be 

= fl5eQn(o,oo),i=i,2,...,fc |w S fl' : P^ {B{p\ , d)) —> P(B(yJ ,6)) j . 

As Q" is the countable intersection of sets of probability one, we have P(fl") = 1. Fix tv £ Cl" and assume that 
the cluster centers p n £ X k are almost minimizers, i.e. 

f^\p n )< inf fi“\p) + e n 
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for some sequence e n > 0 such that 


lim e n = 0. 

n—>■ oo 


( 5 ) 


Assume that lim ||/r ra ||fc = oo. There exists l n £ {1,..., k} such that lim \\fif ||x = oo. Fix x € X then 

n—>■ oo n—too n 

d(x,n?) > m{\\n? -a;||x) -A oo. 


Therefore, for each x £ X, 


i™, I A d ( x -/- i V - A d ( x -^) ] = o- 

Vf = l &ln 


Let S > 0 then there exists N such that for n> N 

k 


Hence 


Letting 6 —> 0 we have 


A - A d ( x ^'D > ~ s - 

j= 1 


liminf j i /\ d{x,fSj) - /\ d{x,$) j P^\dx) > -5. 






and moreover 


lim inf f ( A A d(x,^)) P^(dx) > 0 

' WO ° J \j=i &l n J 

liminf (/M (//>) - ((/i?WO) > 0 , 


where we interpret /A'* accordingly. It suffices to demonstrate that 

liminf ( inf /^( M ) - inf AA/A) < 0. 

ra->oo \^eX k /igx 1 - 1 / 

Indeed, if (O holds, then 

liminf (/M (//*) - /M ((M?)i#I„)) 

= I™ (A w) (M n ) - inf /A } (^)) + liminf ( inf /^(/r) - /<"> ((/i”)^J 

n->oo y n-J-oo \^X k J 


<£ n 


( 6 ) 


(7) 


<0 by © and 0, 
but this contradicts 

We now establish Q. By Assumption Q]4 there exists fc centers /rj £ X and <5i > 0 such that mim,-^ ||//t — 
ft; ||.Y > <5i- Hence for any /i £ X k ~ 1 there exists / € {1, 2,..., k} such that we have 

11/4 ~ H\\x > y for j = 1,2,..., fc — 1. 

Proceeding with this choice of l, for x £ i?(A, <5 2 ) (for any 82 £ (0, <5i/2)) we have 

II/h - x\\ x > A - S 2 


and therefore d(fij,x) > m(4p — 82 ) for all j = 1, 2,..., k — 1. Also 


Di(/u) := min d(x,/j,j) — d(x,fj,]) > m (——<5 2 ) — M(5 2 ). 

j=l, 2 ,...,k —1 2 


( 8 ) 



So for 82 sufficiently small there exists e > 0 such that 

Di(fi) > e. 

Since the right hand side is independent of p G X, 


inf maxD/(«) > e. 
Mex*- 1 1 ' 


Define the characteristic function 


XniO 


1 if < <^2 
0 otherwise. 


where l(p) is the maximizer in ([Sj. For each ut G Cl" one obtains 


inf 

/iGA fc_1 


inf 

fi GA " fe - 1 


1 n k— 1 
*=1 2=1 


1 " 

> inf - V 

uex *- 1 

2=1 


/c -1 


A (1 Xm(£*)) ”f e ) X/u(Ci) 

f=i 


> inf 

/i£A fe 


’<» + ' 


mm 

1 = 1 , 2 ,.... 


p(cx 




Then since pY\b(p\ , 62 )) —> P(B(p\, ^ 2 )) > 0 by Assumption Q]4 (for ^2 G Q D (0, 00 )) we can conclude (|7} 
holds. □ 


Remark 3.4. One can easily show that Assumption\J]2 implies that d is weakly lower semi-continuous in its second 
argument which carries through to It follows that on any bounded (or equivalently as X is reflexive: weakly 
compact) set the infimum of is achieved. Hence the infimum in ProDosition \3.3\ is actually a minimum. 

We now easily prove convergence by application of Theorem l2.ll 

Theorem 3.5. Assuming the conditions of Theorem \3.2\ and Proposition li.il the minimization problem associated 
with the k-means method converges. I.e. for P -almost every uj: 


min foo(p) = 


lim min 

ra->oo fj,£X k 


f { n\p)- 


Furthermore any sequence of minimizers p n of frf 1 is almost surely weakly precompact and any weak limit point 
minimizes foo. 


4 The Case of General Y 

In the previous section the data, f,. and cluster centers, pj, took their values in a common space, X. We now 
remove this restriction and let : Cl —> X and //. ; G Y. We may want to use this framework to deal with 
finite dimensional data and infinite dimensional cluster centers, which can lead to the variational problem having 
uninformative minimizers. 

In the previous section the cost function d was assumed to scale with the underlying norm. This is no longer 
appropriate when d : X x Y —> [0, 00 ). In particular if we consider the smoothing-data association problem then 
the natural choice of d is a pointwise distance which will lead to the optimal cluster centers interpolating data 
points. Hence, in any H s norm with s > 1, the optimal cluster centers “blow up”. 

One possible solution would be to weaken the space to L 2 and allow this type of behavior. This is undesirable 
from both modeling and mathematical perspectives: If we first consider the modeling point of view then we do not 
expect our estimate to perfectly fit the data which is observed in the presence of noise. It is natural that the cluster 
centers are smoother than the data alone would suggest. It is desirable that the optimal clusters should reflect reality. 
From the mathematical point of view, restricting ourselves to only very weak spaces gives no hope of obtaining a 
strongly convergent subsequence. 

An alternative approach is, as is common in the smoothing literature, to use a regularization term. This approach 
is also standard when dealing with ill-posed inverse problems. This changes the nature of the problem and so 
requires some justification. In particular the scaling of the regularization with the data is of fundamental importance. 
In the following section we argue that scaling motivated by a simple Bayesian interpretation of the problem is 
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not strong enough (unsurprisingly, countable collections of finite dimensional observations do not carry enough 
information to provide consistency when dealing with infinite dimensional parameters). In the form of a simple 
example we show that the optimal cluster center is unbounded in the large data limit when the regularization goes 
to zero sufficiently quickly. The natural scaling in this example is for the regularization to vary with the number 
of observations as n p for p e [— 0], We consider the case p = 0 in Section l4~2l This type of regularization is 

understood as penalized likelihood estimation m. 

Although it may seem undesirable for the limiting problem to depend upon the regularization it is unavoidable 
in ill-posed problems such as this one: there is not sufficient information, in even countably infinite collections of 
observations to recover the unknown cluster centers and exploiting known (or expected) regularity in these solutions 
provides one way to combine observations with qualitative prior beliefs about the cluster centers in a principled 
manner. There are many precedents for this approach, including ira in which the consistency of penalized splines 
is studied using, what in this paper we call, the T-limit. In that paper a fixed regularization was used to define 
the limiting problem in order to derive an estimator. Naturally, regularization strong enough to alter the limiting 
problem influences the solution and we cannot hope to obtain consistent estimation in this setting, even in settings 
in which the cost function can be interpreted as the log likelihood of the data generating process. In the setting 
of ma, the regularization is finally scaled to zero whereupon under assumptions the estimator converges to the 
truth but such a step is not feasible in the more complicated settings considered here. 

When more structure is available it may be desirable to further investigate the regularization. For example with 
k = 1 the non-parametric regression model is equivalent to the white noise model G) for which optimal scaling of 
the regularization is known mm It is the subject of further work to extend these results to /.; > 1. 

With our redefined fc-means type problem we can replicate the results of the previous section, and do so in 
Theorem l4.6l That is, we prove that the fc-means method converges where Y is a general separable and reflexive 
Banach space and in particular need not be equal to X. 

This section is split into three subsections. In the first we motivate the regularization term. The second contains 
the convergence theory in a general setting. Establishing that the assumptions of this subsection hold is non-trivial 
and so, in the third subsection, we show an application to the smoothing-data association problem. 


4.1 Regularization 

In this section we use a toy, k = 1, smoothing problem to motivate an approach to regularization which is adopted 
in what follows. We assume that the cluster centers are periodic with equally spaced observations so we may use a 
Fourier argument. In particular we work on the space of 1-periodic functions in H 2 , 

Y = {p : [0,1] —> R s.t. p(0) = p(T) and p G H 2 } . (9) 

For arbitrary sequences (a n ), ( b n ) and data 'T,! = { (tj , £j)}” =1 C [0,1] x M. d we define the functional 

n— 1 

fn\p) = a n l/4*j) - Zjf + b n\\d 2 p\\ 2 L 2. (10) 

3=0 


Data are points in space-time: [0,1] x KL The regularization is chosen so that it penalizes the L 2 norm of the second 
derivative. For simplicity, we employ deterministic measurement times tj in the following proposition although 
this lies outside the formal framework which we consider subsequently. Another simplification we make is to use 
convergence in expectation rather than almost sure convergence. This simplifies our arguments. We stress that this 
section is the motivation for the problem studied in Section l4~2l We will give conditions on the scaling of a n and 
b n that determine whether E min and Kp n stay bounded where p n is the minimizer of f ^. 

Proposition 4.1. Let data be given by 'T „ = {(tj, Zj)}™ = 1 with tj = L under the assumption Zj = p* (tj) + tj 

for ej iid noise with finite variance and p ^ G L 2 and define Y by ©. Then inf fn°\p ) defined by (fTOt stays 
bounded (in expectation) if a n = 0 (—) for any positive sequence b n . 


Proof Assume n is odd. Both p and z are 1-periodic so we can write 


p(t) = - 

71 Z ' 


pie 2 ' Kllt and z. 


i 2 

= i r 

71 < ^ 


zie 


i=— - 




with 


n— 1 

A; = 

3=0 


n— 1 

e n 3 and zi = Zj t 

3=0 


2jrilj_ 
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We will continue to use the notation that fa is the Fourier transform of /j. We write 


-( 


Similarly for z. 




Substituting the Fourier expansion of /j and 2 into fjf' 1 implies 

= ~ Ufa fa) ~ 2 (,M) + (1,1) + —{l 4 fa fa)) 

n \ n / 

where 7 „ = 16 ^ b " and (x, 2 ) = x;£;. The Gateaux derivative dfjfa\p; v) of fif 1 at p in the direction z/ is 

\ 2ctn / - -I T n ^ * ~\ 

o/A > VF',fa = — (v-z + — i i i v ) ■ 
n \ n / 

Which implies the minimizer //" of fjfa ’ is (in terms of its Fourier expansion) 


A” = 1 + 


l 4 


z '■= \ 1 + 


7 4 

7 ,*' 


Zl 


1 =-- 


It follows that the minimum is 

E 


<(/^V)) 


= —E 


1 + 


-^ 74 ) 2 , A] <a„^E^ 2 < 2 a rt n(||/z t ||| 2 +Var(e)). 

7n ' / / j=o 


Similar expressions can be obtained for the case of even n. 
Clearly the natural choice for a n is 


□ 


1 


0-n — 


which we use from here. We let b n = A n p and therefore j n = 167 t 4 A n p+1 . From Pronosition l4.il we immediately 
have E min is bounded for any choice of p. In our next proposition we show that for , 0] our minimizer 

is bounded in H 2 whilst outside this window the norm either blows up or the second derivative converges to zero. 
For simplicity in the calculations we impose the further condition that fa ( t ) = 0. 


Proposition 4.2. In addition to the assumptions of Proposition 1-7771 let a n = — , b n = A n p , ej ~ iV(0,cr 2 ) and 
assume that p n is the minimizer of 

1. For n sufficiently large there exists M\ > 0 such that for all p and n the L 2 norm is bounded: 

mn n \\h < M i- 


2. If p > 0 then 

E||i 9 2 /i "|||2 —> 0 as n —» 00 . 

If we further assume that pfat) = 0, then the following statements are true. 

3. For all p £ [— 0] there exists M 2 > 0 such that 

E||c>V || 2 2 < m 2 . 


4. Ifp < — | then 

E||c) 2 /Lt”|| 2 2 — > 00 asn^roo. 
Proof The first two statements follow from 

E||M n ||| 2 <2(||M t ll! 2 +Var( e )) 

E||5V||i2<^(||/rt|| 2 2+Var(e)) 

7 n 

which are easily shown. Statement 3 is shown after statement 4. 
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Following the calculation in the proof of Pronosition l4.il and assuming that p' (t) = 0, it is easily shown that 


E||dVll| 2 = 


167T 4 er 2 


E 


r 




(1 + 167r 4 An p ( 4 ) 2 


=: S(n) 


( 11 ) 


since E|i;| 2 = a 2 n. To show S(n) —> oo we will manipulate the Riemann sum approximation of 

/■s X 4 

d x = C 


/_ i (1 + 167r 4 Ax 4 ) 2 


where 0 < C < oo. We have 

n x A 




/_ l (1 + 167r 4 Ax 4 ) 2 


da; = n 1+4 


n 4+p w 4 


(1 + 167r 4 An 4+p tu 4 ) 2 

*n-5| 

l 4 


ii P 

du> where x = n 4 w 


E 


(1 + 167r 4 An p / 4 ) 2 


= : R(n). 


Therefore assuming p > —4 we have 


S(») > 


1 + — 
n ^ 4 


So for 1 + < 0 we have 5(n) —>• oo. Since 5(n) is monotonic in p then S(n) —> oo for all p < — |. This shows 

that statement 4 is true. 


Finally we establish the third statement. If p = — | then 


S(n) = I67 T 4 a 2 R(n) + 


167T 4 <7 2 


E 


i 4 




1 (1 + 167r 4 An p ( 4 ) 2 (1 + 167r 4 An p ) 4 ) 2 


2 7 4 

y _ L 

^ (1 + 16tH 


\ 


i= L^J+i 




< 167rV 2 f?(n) H- r 


27r 4 cr 


4_2 


ns (1 + 7T 4 A) 2 


The remaining cases p £ [— 0] are a consequence of (ITU which implies that p > E(d 2 p) is non-increasing. □ 

By the Poincare inequality it follows that if p >-! then the H 2 norm of our minimizer stays bounded as 
n —»■ oo. Our final calculation in this section is to show that the regularization lor p £ [— 0] is not too strong. We 

have already shown that ||c) 2 /r ra ||^2 is bounded (in expectation) in this case but we wish to make sure that we don’t 
have the stronger result that ||3 2 /u n ||L 2 —> 0. 

Proposition 4.3. With the assumptions of Proposition \4~T\ and a n = 4, b n = A n p with p £ [— 0] there exists a 
choice of p) and a constant M > 0 such that if p n is the minimizer of frf 1 then 


E||dy ||| 2 > M. (12) 

Proof. We only need to prove the proposition for p = 0 (the strongest regularization) and find one p' such that ( 1 1 21 
is true. Let p* ( t ) = 2 cos(27rt) = e 2 ™ 4 + e -2 ™ 4 . Then the Fourier transform of p 1 satisfies p\ = 0 for l ^ ±1 and 
p\ = n for l = ±1. So, 


2 . ,n || 2 _ 16 ?r 

^ Wl 2 ~ 


4 2 


E.fl 


l 4 


> 167T 4 


, _ 1 (1 + 167r 4 A ( 4 ) 2 

l 4 


r 2 ~f _ 1 (1 + 167t 4 A ( 4 ) 2 1 

32t r 4 


E| 4 ; | 2 

iaJi 2 


(1 + 167r 4 A ) 2 


> 0. 


□ 
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We have shown that the minimizer is bounded for any p > — - and ||<9 2 £i n ||,L2 —► 0 forp > 0. The case p > 0 
is clearly undesirable as we would be restricting ourselves to straight lines. The natural scaling for this problem 
is in the range p £ [— |, 0]. In the remainder of this paper we consider the case p = 0. This has the advantage 
that, not only E||<9 2 /z"|| P 2, but also ¥.f$f\p n ) is 0(1) as n —> oo. In fact we will show that with this choice 
of regularization we do not need to choose k dependent on the data generating model. The regularization makes 
the methodology sufficiently robust to have convergence even for poor choices of k. For example, if there exists a 
data generating process which is formed of a -mixture model then for our method to be robust does not require 
us to choose k = k' . Of course with the ‘wrong’ choice of k the results may be physically meaningless and we 
should take care in how to interpret the results. The point to stress is that the methodology does not rely on a data 
generating model. 

The disadvantage of this is to potentially increase the bias in the method. Since the fc-means is already biased 
we believe the advantages of our approach outweigh the disadvantages. In particular we have in mind applications 
where only a coarse estimate is needed. For example the fc-means method may be used to initialize some other 
algorithm. Another application could be part of a decision making process: in Section [5T1 we show the fc-means 
methodology can be used to determine whether two tracks have crossed. 

4.2 Convergence For General Y 

Let (Jf, || • ||x), (Y, || • ||y) be reflexive, separable Banach spaces We will also assume that the data points, = 
{£i}7 = i C X for i = 1,2,..., n are iid random elements with common law P. As before p = (/zi, /Z 2 , ■ ■ •, Pk) 
but now the cluster centers pj £ Y for each j. The cost function is d : X x Y —> [0, oo). 

The energy functions associated with the fc-means algorithm in this setting are slightly different to those used 
previously: 


k 

g/j, : X R, g fl (x)= /\d(x,Hj), 

3 =1 

/M : Y k -+ R, (p) = pM 9tt + Xr(fi), 

/oo : Y k R, foo(fj) = Pg M + A r(p). 

The aim of this section is to show the convergence result: 

0 ^ = inf /)AA (p) —y inf (p) = 6 and as n —> oo for P-almost every w 

ii£Y k 

and that minimizers converge (almost surely). 

The key assumptions are given in Assumptions [2) they imply that is weakly lower semi-continuous and 
coercive. In particular. Assumption [2]2 allows us to prove the lim inf inequality as we did for Theorem 13.21 
Assumption U 1 is likely to mean that our convergence results are limited to the case of bounded noise. In fact, 
when applying the problem to the smoothing-data association problem, it is necessary to bound the noise in order 
for Assumption^ 5 to hold. Assumption[2]5 implies that is (uniformly) coercive and hence allows us to easily 
bound the set of minimizers. It is the subject of ongoing research to extend the convergence results to unbounded 
noise for the smoothing-data association problem. Assumption [2] 3 is a measurability condition we require in 
order to integrate and the weak lower semi-continuity of r is needed for the to obtain the lim inf inequality in the 
r-convergence proof. 

We note that, since Pd(-, p i) < sup xesupp ( P ) d(x,pi) < oo, we have foo(p) < oo for every p G Y k (and 
since r(p) < oo for each p £ Y k ). 

Assumptions 2. We have the following assumptions on d : X x Y —> [0, oo), r : Y k —> [0, oo) and P. 

2.1. For all y € Y we have sup x6supp( - P ) d(x, y) < oo where supp(P) C X is the support of P. 

2.2. For each x £ X and y £ Y we have that ifx m —> x and y n —*■ y as n,m —^ oo then 

lim inf d(x m , y n ) > d(x, y) and lim d(x m ,y) = d(x,y). 

n,m—f oo m—foo 

2.3. For every y £ Y we have that d(-, y) is X-measurable. 

2.4. r is weakly lower semi-continuous. 

2.5. r is coercive. 

We will follow the structure of Section[3] We start by showing that under the above conditions fif 1 F-converges 
to /oo- We then show that the regularization term guarantees that the minimizers to frf J lie in a bounded set. An 


(13) 

(14) 
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application of Theorem 12. 1 1 gives the desired convergence result. Since we were able to restrict our analysis to a 
weakly compact subset of Y we are easily able to deduce the existence of a weakly convergent subsequence. 
Similarly to the previous section on the product space Y k we use the norm ||/r||fc := maxj || pj ||y. 

Theorem 4.4. Let ( X , || • ||.\') and (Y, || • ||y) be separable and reflexive Banach spaces. Assume r : Y k —» [0, oo), 
d : X x Y —> [0,oo) and the probability measure P on (X, X) satisfy the conditions in Assumptions \2\ For 
independent samples from P define Prf* to be the empirical measure and fjfl^ : Y k —> R and : 

Y k —> R by (113b and (1 1 4b respectively and where A > 0. Then 

/ 00 = r-iim/(“) 

n 


for P -almost every uj. 

Proof Define 

!r = ( wS n:^“ ) 4P}n{w€!l: 6 supp(P) Mi G n} . 

Then P(S2') = 1. For the remainder of the proof we consider an arbitrary at G Cl'. We start with the lim inf 
inequality. Let p n p then 

lim inf f^\p n ) > foo(p) 

n—> oo 

follows (as in the proof of Theorem 13.21) by applying Theorem 1.1 in lfl5l and the fact that r is weakly lower 
semi-continuous. 

We now establish the existence of a recovery sequence. Let p G Y k and let p n = p. We want to show 
lim f^\p) = lim P^ ] g^ + A r(p) = Pg M + A r{p) = foo(p). 

n—f oo n—f oo 

Clearly this is equivalent to showing that 

lim Pi w) g f , = Pgp. 

n—> oo 

Now are continuous by assumption on d. Let M = sup 2 , gsupp ( P ) d(x, pi) < oo and note that g^{x) < M for 
all x G supp(P) and therefore bounded. Hence pjf' 1 g^ —> Pg M . □ 

Proposition 4.5. Assuming the conditions of Theorem \4.4\ then for P-almost every u> there exists N < oo and 
R > 0 such that 

min /<"> (p) = min /<"> (p) < inf /("> (p) Mn > N. 

[i£Y k llA»IU<fl 

In particular R is independent of n. 


Proof. Let 

Cl" = {uG!1': P^ =► P} n {w G Cl' : pMd(-.O) -► Pd(-,0)} . 

Then, for every oj G Cl", f$f\ 0) —> /oo(0) < oo where with a slight abuse of notation we denote the zero element 
in both Y and Y k by 0. Take N sufficiently large so that 

/^ } (0)</oo(0) + l for all n> N. 


Then min Mei -fe fi u \p) < /oo(0) + 1 for all n > N. By coercivity of r there exists R such that if ||/z||fc > R then 
A r(p) > foo(0) + 1. Therefore any such p is not a minimizer and in particular any minimizer must be contained 
in the set {p£Y k : ||/u||/- < P}. □ 

The convergence results now follows by applying Theorem l4.4l and Pronosition l4.5l to Theorem l2.ll 

Theorem 4.6. Assuming the conditions of Theorem \4.4\ and Pronosition \4.5\ the minimization problem associated 
with the k-means method converges in the following sense: 


min 

fj.eY k 


f°o(p)= lim min f^\p) 

n—too fj,£Y k 


for P -almost every u>. Furthermore any sequence of minimizers p n of is almost surely weakly precompact and 
any weak limit point minimizes f^. 

It was not necessary to assume that cluster centers are in a common space. A trivial generalization would allow 
each pj G Y^P with the cost and regularization terms appropriately defined; in this setting Theorem l4.6l holds. 
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4.3 Application to the Smoothing-Data Association Problem 

In this section we give an application to the smoothing-data association problem and show the assumptions in the 
previous section are met. For k = 1 the smoothing-data association problem is the problem of fitting a curve to 
a data set (no data association). For k > 1 we couple the smoothing problem with a data association problem. 
Each data point is associated with an unknown member of a collection of k curves. Solving the problem involves 
simultaneously estimating both the data partition (i.e. the association of observations to curves) and the curve 
which best fits each subset of the data. By treating the curve of best fit as the cluster center we are able to approach 
this problem using the fc-means methodology. The data points are points in space-time whilst cluster centers are 
functions from time to space. 

We let the Euclidean norm on M K be given by | • |. Let I = Kx be the data space. We will subsequently 
assume that the support of P, the common law of our observations, is contained within X = [0, T] x X' where 
X' C [-N, A] K . We define the cluster center space to be Y = // 2 ([(t, T\), the Sobolev space of functions from 
[0, T] to R K . Clearly X and Y are separable and reflexive. The cost function d : X x Y —> [0, oo) is defined by 

= Hj{t) | 2 (15) 

where /ij £ Y and £ = (t. z ) £ X. We introduce a regularization term that penalizes the second derivative. This is 
a common choice in the smoothing literature, e.g. E3- The regularization term r : Y k —> [0, oo) is given by 

k 

r M = Yl ll^flli 2 - ( 16 ) 

3 =1 

The fc-means energy f n for data points {£; = (U, Zj)}" =1 is therefore written 

1 n k n k k 

fn{») = /\d&, N ) + W») = A l^-Mi(ii )| 2 + A^||5 2 /i i ||| 2 . (17) 

i=l j —1 *=1 3 —1 j =1 

In most cases it is reasonable to assume that any minimizer of /oo must be uniformly bounded, i.e. there exists 
N (which will in general depend on P) such that if p°° minimizes /oo then \p°°(t)\ < N for all t £ [0, T\. Under 
this assumption we redefine Y to be 

Y = {p 3 £ H 2 ([0,T}) : \pj(f)\ < N\/t £ [0, T]}. (18) 

Since pointwise evaluation is a bounded linear functional in H s (for s > 1) this space is weakly closed. We 
now minimize /„ over Y k . Note that we are not immediately guaranteed that minimizers of /„ over (H s ) k are 
contained in Y k . However when we apply Theorem 14. 61 we can conclude that minimizers //' of f n over Yk are 
weakly compact in ( H s ) k and any limit point is a minimizer of /oo in Y . And therefore any limit point is a 
minimizer of /oo over (H s ) k . 

If no such N exists then our results in Theorem 14.61 are still valid however the minimum of /oo over (H s ) k is 
not necessarily equal to the minimum of /oo over Y k . 

Our results show that the F-limit for P-almost every ui is 

/oo(m) = f /\ d(x,Hj)P(dx) + Xr(fx) = f /\ \z - /j,j(t)\ 2 P(dx) + W&VjWl*- (19) 

Jx j =i Jx j =i j=i 

We start with the key result for this section, that is the existence of a weakly converging subsequence of minimizers. 
Our result relies upon the regularity of Sobolev functions. For our result to be meaningful we require that the 
minimizer should at least be continuous. In fact every g £ iT 2 ([0,T]) is in C s ([0,T]) for any s < |. The 
regularity in the space allows us to further deduce the existence of a strongly converging subsequence. 

Theorem 4.7. Let X = [0, T] x R K and define Y by (fTSl i. Define d : X x Y —> [0, oo) by (IT5l > and r : Y k —> [0, oo) 
by (US. For independent samples {£i}” =1 from P which has compact support X C X define f n , /^ : Y k —> R by 
<H3 and © respectively. 

Then (1) any sequence of minimizers p n £ Y k of f n is P -almost surely weakly-precompact (in H 2 ) with any 
weak limit point of p n minimizes /oo and (2) if p nm pis a weakly converging (in H 2 ) subsequence of minimizers 
then the convergence is uniform (in C°). 

To prove the first part of Theorem l4.7l we are required to check the boundedness and continuity assumptions on 
d (Pronosition l4.8l > and show that r is weakly lower semi-continuous and coercive (Pronosition l4.9l) . This statement 
is then a straightforward application of Theorem l4.6l Note that we will have shown the result of Theorem l4.4l holds: 

/oo =F-lim„/^ ) . 
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In what follows we check that properties hold for any x £ X, which should be understood as implying that they 
hold for P-almost any x £ X\ this is sufficient for our purposes as the collection of sequences £ 1 ,... for which 
one or more observations lies in the complement of X is P-null and the support of P n is P-almost surely contained 
within A\ 

Proposition 4.8. Let X = [0, T] x [-N, 1V] K and define Y by dTU) . Define d : X x Y —> [0, oo) by (IT5l) . Then (i) 
for all y £ Y we have sup 7 . g;s > d(x, y) < oo and (ii) for any x £ X and y £ Y and any sequences x m —> x and 
y n y as m,n oo then we have liminf n ,m^oo d(x m ,y n ) = d(x, y). 

Proof. We start with (i). Let y £ Y and x = (t, z ) £ [0, T\ x [-N, iV] K , then 

d(x,y) = \z-y{t )\ 2 

< 2 \z \ 2 + 2 \y{t )\ 2 

< 2N 2 + 2 sup \y(t)\ 2 . 
te[o ,T] 

Since y is continuous then sup tg r 0 T i \y{t )\ 2 < oo and moreover we can bound d(x, y) independently of x which 
shows (i). 

For (ii) we let (£ m , z m ) = x m —> x = (t, z) in K K+1 and y n y. Then 

diyXm^yn) = \z m yn(fm)\ 

= \z m \ 2 - 2 z m ■ y n (tm) + \yn{t m )\ 2 . (20) 

Clearly \z m \ 2 —> \z \ 2 and we now show that y n (tm ) —> yif) as m,n—> oo. 

We start by showing that the sequence ||j/ n ||y is bounded. Each y n can be associated with A„ £ Y** by 
A n (y) = L>(y n ) for v £ Y*. As y n is weakly convergent it is weakly bounded. So, 

sup |A„(^)| = sup \v{y n )\ < M u 

nG N nG N 

for some M„ < oo. By the uniform boundedness principle IfTOl 

SUp ||A„||y.. < OO. 
neN 


And so. 


sup WVuWy = sup ||A„||y.» < OO. 

nGN nGN 


Hence there exists M > 0 such that ||t/„||y < M. Therefore 


I Vnir) - y n (s)\ = 


dy n {t) d t 


< I \dy n (t)\ d t = [ I[ g , r ](£) \dy„(t)\ d t 
Js Jo 


< P[g,r]l|z,= \\dy n (t )\\ L2 < My/\l s|. 


Since y n is uniformly bounded and equi-continuous then by the Arzela-Ascoli theorem there exists a uniformly 
converging subsequence, say y r , m —> y. By uniqueness of the weak limit y = y. But this implies that 


Vn if) y(t) 


uniformly for t £ [0, T], Now as 

I yn(t m ) - y(t) | < | yn(fm) - y(t m ) I + I y{t m ) - y(t)\ 
then y n (t m ) —> y(t) as m, n —> oo. Therefore the second and third terms of (l20l > satisfies 

2 z r n * ymif m ) -> 2 z-y(t) 

\yn(tm )\ 2 ^ \y(t )\ 2 


as ?n, n —y oo. Hence 


d(x m , y n ) \z \ 2 - 2 z ■ y(t) + \y{t )\ 2 = \z - y(t )\ 2 = d{x, y) 

which completes the proof. 


□ 
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Proposition 4.9. Define Y by (fl8l ) and r : Y k —> [0, oo) by (IT6l >. Then r is weakly lower semi-continuous and 
coercive. 


Proof. We start by showing r is weakly lower semi-continuous. For any weakly converging sequence //” —*• fi \ in 
H 2 we have that d 2 /if —*■ d 2 jj\ weakly in if. Hence it follows that r is weakly lower semi-continuous. 

To show r is coercive let r(/ii) = ||<9 2 /ri|| 2 2 for fi\ £ Y. We will show f is coercive. Let fi\ £ Y and note 
that since /.<i £ C 1 the first derivative exists (strongly). Clearly we have ||/ri || i 2 < Ny/T and using a Poincare 
inequality 


d Mi 

dt 


1 

T 



<C ||dV|| i2 

L 2 


for some C independent of /i \. Therefore 


d/ii 

dt 


< C||9 2 ^i|| i 2 + 
l 2 


1 

T 



< C'||3 2 /.ti|| i 2 + —. 


It follows that if ||/ii||#2 —> oo then ||i9 2 /.ii ||^2 —x oo, hence f is coercive. □ 

Finally, the existence of a strongly convergent subsequence in Theorem 14.71 follows from the fact that H 2 is 
compactly embedded into H 1 . Hence the convergence is strong in // 1 . By Morrey’s inequality // 1 is embedded 
into a Holder space (Cfi 2 ) which is a subset of uniformly continuous functions. This implies the convergence is 
uniform in C°. 


5 Examples 

In this section we give two exemplar applications of the methodology. In principle any cost function, d, and 
regularization, r, (that satisfy the conditions) could be used. For illustrative purposes we choose d and r to make 
the minimization simple to implement. In particular, in Example 1 our choices allow us to use smoothing splines. 

5.1 Example 1: A Smoothing-Data Association Problem 

We use the fc-means method to solve a smoothing-data association problem. For each j = 1,2, 
functions x 2 : [0,T]xR for j = 1,2,..., k as the “true” cluster centers, and for sample times t 2 for i 
uniformly distributed over [0, T], we let 

4= at (4)+ 4 

where ej are iid noise terms. 

The observations take the form f = (fi, zf) for i = 1,2 ,... ,n = X^j=i n j where we have relabeled the 
observations to remove the (unobserved) target reference. We model the observations with density (with respect to 
the Lebesgue measure) 

= h l0 , T] (t)±, Wj pfz - x°{t)) 

7=1 

onR x 1 where p e denotes the common density of the ej and w :j denotes the probability that an observation is 
generated by trajectory j. We let each cluster center be equally weighted: wj = p The cluster centers were fixed 
and in particular did not vary between numerical experiments. 

When the noise is bounded this is precisely the problem described in Section [431 with k = 1, hence the problem 
converges. We use a truncated Gaussian noise term. 

In the theoretical analysis of the algorithm we have considered only the minimization problem associated with 
the fc-means algorithm; of course minimizing complex functionals of the form of f n is itself a challenging problem. 
Practically, we adopt the usual fc-means strategy l23l of iteratively assigning data to the closest of a collection of 
fc centers and then re-estimating each center by finding the center which minimizes the average regularized cost of 
the observations currently associated with that center. As the energy function is bounded below and monotonically 
decreasing over iterations, this algorithm converges to a local (but not necessarily global) minimum. 

More precisely, in the particular example considered here we employ the following iterative procedure: 

1. Initialize ip° : {1, 2,..., n} —> {1, 2,..., fc} arbitrarily. 


..., fc we take 

= 1,2,... 7Tj, 
( 21 ) 
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Figure 1: Smoothed data association trajectory results for the fc-means method. 
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The figure on the left shows the raw data with the data generating model. That on the right shows the output of the A:-means 
algorithm. The parameters used are: k = 3,T = 10, ej from a 1V(0, 5) truncated at ±100, A = 1, tc 1 (f) = —15 — 2t ± 0.2A 2 , 
x 2 (t) =5 + t and x 3 (t) = 40. 


2. For a given data partition ip r : {1,2,..., n} — > {1,2,.... A: } we independently find the cluster centers 

= ((4, M 2 , • • •, a 4) where each ^ e H 2 ([ 0, T]) by 

= argmin — ^ |zj - + A||d 2 /^||| 2 for j = 1, 2,..., k. 

l^j ^ ■ r / ' \ 

r.ip r W=3 

This is done using smoothing splines. 

3. Data is repartitioned using the cluster centers //. r 

tp r+1 (i) = argmin |Zi - 


4. If ip r+1 ^ ip r then return to Step 2. Else we terminate. 

Let = (//",..., fi%) be the output of the /c-means algorithm from n data points. To evaluate the success of 
the methodology when dealing with a finite sample of n data points we look at how many iterations are required to 
reach convergence (defined as an assignment which is unchanged over the course of an algorithmic iteration), the 
number of data points correctly associated, the metric 


r](n) 


1 

~ k \ 


Ei 

i=i 


xi 


2 

L 2 


and the energy 

o n = un 

where 

n k k 

fn(fi) = ~E A N-^(*oi 2 + A Eii 5 EiiE 

i=11=1 1=1 

Figure Q] shows the raw data and output of the fc-means algorithm for one realization of the model. We run 
Monte Carlo trials for increasing numbers of data points; in particular we run 10 3 numerical trials independently 
for each n = 300, 600,..., 3000 where we generate the data from (l2ll and cluster using the above algorithm. Each 
numerical experiment is independent. 

Results, shown in Figure |2j illustrate that as measured by r/ the performance of the A:-means method improves 
with the size of the available data set, as do the proportion of data points correctly assigned. The minimum energy 
stabilizes as the size of the data set increases, although the algorithm does take more iterations for the method to 
converge. We also note that the energy of the data generating functions is higher than the minimum energy. 

Since the iterative A;-means algorithm described above does not necessarily identify global minima, we tested 
the algorithm on two targets whose paths intersect as shown in Figure [3] The data association hypotheses cor¬ 
responding to correct and incorrect associations, after the crossing point, correspond to two local minima. The 
observation window [0, T] was expanded to investigate the convergence to the correct data association hypothesis. 


18 










Figure 2: Monte Carlo convergence results. 



n n n 

Convergence results for the parameters given in Figure Q] In (a) the thick dotted line corresponds to the median number of 
iterations taken for the method to converge and the thinner dotted lines are the 25% and 75% quantiles. The thick solid line 
corresponds to the median percentage of data points correctly identified and the thinner solid line are the 25% and 75% quantiles, 
(b) shows the median value of rj(n) (solid), interquartile range (box) and the interval between the 5% and 95% percentiles 
(whiskers), (c) shows the mean minimum energy 9 n (solid) and the 10% and 90% quantiles (dashed). The energy associated 
with the data generating model is also shown (long dashes). In order to increase the chance of finding a global minimum for each 
Monte Carlo trial ten different initializations were tried and the one that had the smallest energy on termination was recorded. 


Figure 3: Crossing tracks in the fc-means method. 



Typical data sets for times up to T max with cluster centers, fitted up till T, exhibiting crossing and non-crossing behavior. The 
parameters used are k = 2, T m i n = 9.6 < T < 11 = T max , ej ~ N( 0, 5), x 1 (f) = —20 + t 2 and x 2 [t) = 20 + 4 1. There are 
n = 220 data points uniformly distributed over [0,11] with 110 observations for each track. The crossing occurs at approximately 
t ss 8.6 but we wait a further time unit before investigating the decision making procedure. 


To enable this to be described in more detail we introduce the crossing and non-crossing energies: 

Ec = ^/ra(ftc) 

-E'nc = nc) 

where // c and //. nc are the fc-means centers for the crossing (correct) and non-crossing (incorrect) solutions. To allow 
the association performance to be quantified, we therefore define the relative energy 

A E = E c — E nc . 

To determine how many numerical trials we should run in order to get a good number of simulations that 
produce crossing and non-crossing outputs we first ran the experiment until we achieved at least 100 tracks that 
crossed and at least 100 that did not. I.e. let be the number of trials that output tracks that crossed and iV t nc be the 
number of trials that output tracks that did not cross. We stop when min{7V t c , 7V t nc } > 100. Let N t = 10 (A t c + iV t c ). 
We then re-ran the experiment with N t trials so we expect that we get 1000 tracks that do not cross and 1000 tracks 
that do cross at each time t. 

The results in Figure [4] show that initially the better solution to the /.--means minimization problem is the one 
that incorrectly partitions the tracks after the intersection. However, as time is run forward the /. -means favors the 
partition that correctly associates tracks to targets. This is reflected in both an increase in A E and the percentage of 
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Figure 4: Energy differences in the /. -means method. 



Mean results are shown for data obtained using the parameters given in Figure[3]for data up to time T (between 7 mm and T max ). 
The thick solid line shows the mean AE and the thinner lines one standard deviation either side of the mean. The dashed line 
shows the percentage of times we correctly identified the tracks as crossing. 


outputs that correctly identify the switch. Our results show that for T > 9.7 the energy difference between the two 
minima grows linearly with time. However, when we look which minima the means algorithm finds our results 
suggest that after time T « 10.25 the probability of finding the correct minima stabilizes at approximately 64%. 
There is reasonably large variance in the energy difference. The mean plus standard deviation is positive for all T 
greater than 9.8, however it takes until T = 10.8 for the average energy difference to be positive. 

5.2 Example 2: Passive Electromagnetic Source Tracking 

In the previous example the data is simply a linear projection of the trajectories. In contrast, here we consider 
the more general case where the measurement A' and model Y spaces are very different; being connected by a 
complicated mapping that results in a very non-linear cost function d. While the increased complexity of the cost 
function does lead to a (linear in data size) increase in computational cost, the problem is equally amenable to our 
approach. 

In this example we consider the tracking of targets that periodically emit radio pulses as they travel on a two 
dimensional surface. These emissions are detected by an array of (three) sensors that characterize the detected 
emissions in terms of ‘time of arrival’, ‘signal amplitude’ and the ‘identity of the sensor making the detection’. 

Expressed in this way, the problem has a structure which does not fall directly within the framework which the 
theoretical results of previous sections cover. In particular, the observations are not independent (we have exactly 
one from each target in each measurement interval), they are not identically distributed and they do not admit an 
empirical measure which is weakly convergent in the large data limit. 

This formulation could be refined so that the problem did fall precisely within the framework; but only at the 
expense of losing physical clarity. This is not done but as shall be seen below, even in the current formulation, 
good performance is obtained. This gives some confidence that fc-means like strategies in general settings, at least 
when the qualitatively important features of the problem are close to those considered theoretically, and gives some 
heuristic justification for the lack of rigor. 

Three sensors receive amplitude and time of arrival from each target with periodicity r. Data at each sensor are 
points in R 2 whilst the cluster centers (trajectories) are time-parameterized curves in a different IB 2 space. 

In the generating model, for clarity we again index the targets in the observed amplitude and time of arrival. 
However, we again assume that this identifier is not observed and this notation is redefined (identities suppressed) 
when we apply the fc-means method. 

Let Xj (t) £ R 2 be the position of target j for j = 1,2,... k at time t £ [0, T\. In every time frame of length r 
each target emits a signal which is detected at three sensors. The time difference from the start of the time frame 
to when the target emits this signal is called the time offset. The time offset for each target is a constant which we 
call Oj for j = 1.2,..., k. Target j therefore emits a signal at times 


tj (to) = TOT + Oj 

for to £ N such that tj(m) < T. Note that this is not the time of arrival and we do not observe tj(m). 

Sensor p at position z p detects this signal some time later and measures the time of arrival tj(m) £ [0, T] and 
amplitude a p (to) £ R from target j. The time of arrival is 


t P (to) = TOT + Oj + 


M m ) - z p \ 


+ e p Jm) =tj(m) + 


\Xj (to) Zp\ | p 


(m) 
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where c is the speed of the signal and e } ’(rn) are iid noise terms with variance <r 2 . The amplitude is 


a? ( m ) = 


Mm) 


a 


- z P \ 2 + P 


Sj(m) 


where a and /? are constants and 6 ? (to) are iid noise terms with variance v 2 . We assume the parameters a, /?, c, a, 
t, v and z p are known. 

To simplify the notation I \ q x : R 2 —> R is the projection of x onto it’s q th coordinate for q = 1,2. I.e. the 
position of target j at time t can be written Xj (t) = (IliXj (t), F^Xj (£)). 

In practice we do not know to which target each observation corresponds. We use the fc-means method to 
partition a set {£,; = (fj, into the k targets. Note the relabeling of indices; = (fj, a^p, ) is the time of 

arrival L, amplitude a, and sensor p, of the i th detection. The cluster centers are in a function-parameter product 
space p,j = [xj (£), Oj) G C°([0, T]; R 2 ) x [0, r) C C°([0, T]; R 2 ) x R that estimates the j th target’s trajectory and 
time offset. The fc-means minimization problem is 


^ n k 

p n = argmin - V A 

/xe(C°x[o,T)) fe 71 i=1 J=1 

for a choice of cost function d. If we look for cluster centers as straight trajectories then we can restrict ourselves 
to functions of the form Xj(t) = ,z' 5 (0) + v jt and consider the cluster centers as finite dimensional objects. This 
allows us to redefine our minimization problem as 




n 


l 

argmin — 

Me(K 4 x[ 0 ,r)) fe n 


A d &’Vj) 

*=1 1=1 


so that now /x, = (xj(0),Vj,Oj) G R 2 x K 2 x [0,r). We note that in this finite dimensional formulation it is 
not necessary to include a regularization term; a feature already anticipated in the definition of the minimization 
problem. 

For p,j = (xj,Vj, Oj) we define the cost function 


d((t,a,p),Hj) = {(t,a) - ip(pj,p,Tn)) 


7* 0 

0 4 




where to = max{n G N : nr < t}, 

| Xj + rriTVj — z p | 


ip(pj,p,m) = 


+ Oj + TOT, log 


a 


X,' + TTlTV'j — Zp 2 + , 


and superscript T denotes the transpose. 

We initialize the partitions by uniformly randomly choosing ip° : {1, 2,..., n} {1, 2,..., A:}. At the r th 
iteration the fc-means minimization problem is then partitioned into k independent problems 


= argmin V' d((U,ai,pi), pPA for 1 < j < k. 


A range of initializations for p 3 are used to increase the chance of the method converging to a global minimum. 

For optimal centers conditioned on partition ' we can define the partition ip r to be the optimal partition of 
{(fj, ai,pj)} 2 -r conditioned on centers (pj) by solving 

ip r : { 1 , 2 , ...,n} ->• { 1,2 

iT argmin d((U, Oj,p»), pL). 
j= 1,2 fc 


The method has converged when p r = p r_1 for some r. Typical simulated data and resulting trajectories are 
shown in Figure [5] 

To illustrate the convergence result achieved above we performed a test on a set of data simulated from the same 
model as in Figure[5] We sample n s observations from {(£,.yz,)} (L [ and compare our results as n s —> n. Let 
x " 3 ( t ) = (x ™ 3 (£),..., x n k 3 (£)) be the position output by the fc-means method described above using n s data points 
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Figure 5: Representative data and resulting tracks for the passive tracking example. 
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Representative data is shown for the parameters k = 2, r = 1, T = 1000, c = 100, Z\ = (—10, —10), Z 2 = (10, —10), Z 3 — 

(0,10), ef (to) ~ N( 0,0.03 2 ), S^(m) ~ JV(0,0.05 2 ), a = 10 8 , /3 = 5, Xl (t) = ^§(1,1) + (0,5), x 2 (t) = (6,7) - ^(1,0), 

01 = 0.3 and 02 = 0.6, given the sensor configuration shown at the top of the figure. The fc-means method was run until it 
converged, with the trajectory component of the resulting cluster centers plotted with the true trajectories at the top of the figure. 
Target one is the dashed line with starred data points, target two is the solid line and square data points. 
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and x(t) = ..., Xk(t)) be the true values of each cluster center. We use the metric 


V(n s ) 


k \ 


k 


-Xj\\ 2 L 2 
i=i 


to measure how close the estimated position is to the exact position. Note we do not use the estimated time offset 
given by the first model. The number of iterations required for the method to converge is also recorded. Results are 
shown in Figure [6] 

In this example the data has enough separation that we are always able to recover the true data partition. We 
also see improvement in our estimated cluster centers and convergence of the minimum energy as we increase the 
size of the data. Finding global minima is difficult and although we run the fc-means method from multiple starting 
points we sometimes only find local minima. For = 0.3 we see the effect of finding local minima. In this 
case only one Monte Carlo trial produces a bad result, but the error r] is so great (around 28 times greater than the 
average) that it can be seen in the mean result shown in Figure[6jc). 


Figure 6: Monte Carlo convergence results. 



n s /n n s /n n s /n 

Convergence results for 10 3 Monte Carlo trials with the parameters given in Figure 0 expressed with the notation used in 
Figure [2] In (a) we have also recorded the mean number of iterations to converge (long dashes). The 25% and 75% quantiles 
for the number of iterations to converge is 2 and 4 for all n respectively. The 25% and 75% quantiles for the percentage of data 
points correctly identified is 100% in both cases for all n. This is due to large separation in the data space. To increase the chance 
of finding a global minimum for each Monte Carlo trial, out of five different initializations, that which had the smallest energy 
on terminating was recorded. 
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